-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use polars for seaexplorer data file load #120
use polars for seaexplorer data file load #120
Conversation
This is failing some tests at the moment. Some of those are expected (they use intermediate ncs which are now parquet files) others look like some errors with timestamp parsing and some other issues. I'll work on these |
Just got to resolve the optional sensor-sepcific coarsening now |
I wonder if this needs to be its own method? I'd need to be convinced that parquet files are a better intermediate format than netcdf. We need to be able to look at the intermediate files to see what is in the raw data - what is the workflow for parquet files? Just polars? I don't think xarray handles them. Does the efficiency go away if you use netcdf and polars? We recently had this issue with the raw Alseamar files and solved it by subsampling the redundant information and reduced the raw files down from 10Mb each to 40kb |
These are good points, thanks Jody. We've tried subsampling the raw alseamar files, but with the 16 Hz legato running for 3 week missions, we still end up with huge datasets and the load/merge step is taking several hours per dataset. I'm starting to profile the code now. polars makes time savings over pandas when loading and merging the data. parquet files are quicker and more storage efficient than ncs to write and read, but they do need to be readable as intermediate products. This is achieved in a very similar way to pandas:
to me this makes more sense, as at the load_merge stage there is no metadata to make the bulkier netcdfs necessary for these intermediate table-like files. I'm working on a full demo at the moment. I'd like to make use of this speedup, but if it's incompatible with existing workflows I'll make it a separate function or just keep it for internal use at VOTO |
I've put together a rough notebook profiling the polars implementation against the current main branch. It shows the performance differences using sub and raw data as well as showing what the intermediate parquet products look like when loaded in polars https://github.com/callumrollo/pyglider_profile/blob/main/profile_pyglider.ipynb It's pretty ugly, but shows the main points. I can work to make it portable//repeatable if desired |
OK 5-10x is a big deal if you have 24 Hz data. If you do this, can you update the docs to explain what a parquet file is and a quick how-to on opening them? I think we can assume pandas and xarray knowledge (we are writing to netcdf), but parquet probably needs a paragraph of explanation. Final question would be if parquet and pandas have any packaging issues? Does parquet work with a |
Would Zarr be a possible format option for pyglider? |
I've had a look and it turns out pandas is able to read and write parquet files, so user shouldn't ever need to interact with polars. I'll add a paragraph to the docs explaining it. I've not encountered problems adding polars to the build so far. I'll test it out on a windows machine today to be sure though. Polars has mature builds on PyPI and conda-forge |
No issues making conda environments in pandas. I'm testing it on a range of our SeaExplorer datasets now. I'll mark ready for review once I'm satisfied it's performing well in production |
I've tested this pretty extensively on our datasets now. I think it's good to go. @jklymak do you have any other requested changes? |
Give me a day or two to look it over. Probably is all good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly looks good, just a couple of style things, and we probably need to change a top-level name.
I think we need to decide if we are used enough yet to need a deprecation cycle for removing things. I think we are probably fine to do this now, but at some point we can't change a bunch of underlying code that changes the output file type(!) without a deprecation cycle. In which case, code like this would be better as a new API and the old API kept around for users who need it, with a deprecation warning.
try: | ||
out = pl.read_csv(f, sep=';') | ||
except: | ||
_log.warning(f'Could not read {f}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add the badfiles
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
badfiles added
except: | ||
_log.warning(f'Could not read {f}') | ||
continue | ||
if "Timestamp" in out.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this get a comment? Why do we need this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is for corrupted files. I encountered this once in ~ 50 missions I've processed so far. I've added a comment
if ftype == 'gli': | ||
outx.to_netcdf(fnout[:-3] + '.nc', 'w') | ||
if rawsub == 'raw' and dropna_subset is not None: | ||
out = out.with_column(out.select(pl.col(dropna_subset).is_null().cast(pl.Int64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a comment, and if it doesn't cause a slowdown or extra writes maybe would benefit from unpacking into separate calls. Looks like you are dropping repeats, but its too many steps in one line for me to follow without spending too much time ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment on this. It's a bit convoluted looking, but it's just the polars equivalent of pandas.dropna. I didn't want to change the functionality of the dropna option in this PR. We can factor it out as a separate PR though?
pyglider/seaexplorer.py
Outdated
post_1971 = df.filter(pl.col("time") > dt_1971) | ||
if len(post_1971) == len(df): | ||
return post_1971 | ||
return df.filter(pl.col("time") > dt_1971) | ||
|
||
|
||
def merge_rawnc(indir, outdir, deploymentyaml, incremental=False, kind='raw'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably rename this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed this to the more descriptive drop_pre_1971_samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant rename merege_rawnc.... Though that screws with older scripts, but I'd rename this merge_parquet or whatever, and then alias merge_rawnc so old scripts work
I've added some more comments and used badfiles if a file fails the load. I have no idea how these changes have caused a couple of the slocum tests to start failing |
OK I think we're good to go now. @jklymak has this met your requested changes? In future I'll make these kind of changes as part of a deprecation cycle as you suggest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks close, just a couple more places to clarify the code.
pyglider/seaexplorer.py
Outdated
post_1971 = df.filter(pl.col("time") > dt_1971) | ||
if len(post_1971) == len(df): | ||
return post_1971 | ||
return df.filter(pl.col("time") > dt_1971) | ||
|
||
|
||
def merge_rawnc(indir, outdir, deploymentyaml, incremental=False, kind='raw'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant rename merege_rawnc.... Though that screws with older scripts, but I'd rename this merge_parquet or whatever, and then alias merge_rawnc so old scripts work
val = _interp_gli_to_pld(sensor_sub, sensor, val2, indctd) | ||
coarse_ints = np.arange(0, len(sensor)/coarsen_time, 1/coarsen_time).astype(int) | ||
sensor_sub = sensor.with_columns(pl.lit(coarse_ints).alias("coarse_ints")) | ||
sensor_sub_grouped = sensor_sub.with_column( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block needs a comment as well... It says "smooth" oxygen above, but I'm not following this danse with coarse_ints
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes more sense! I've renamed to merge_parquet and added merge_rawnc = merge_parquet
so hopefully older scripts will still work. I've also added more comments to the variables coarsening
@callumrollo feel free to self merge, perhaps squash commits that you don't need, or just squash on merge. Thanks! we are about to try this out on some data sets that have been giving us problems. ping @hvdosser |
We have at least one dataset that has 16Hz legato data (plus 1Hz GPCTD), which would probably benefit from this change. Pinging @clayton to try this out on that mission. |
@richardsc @clayton This is now merged, and on master. It's not yet released, so you will need to install from the development branch. Let us know if that is problematic. If it manages to get through a few more glider setups we should release as well as the changes to the slocum processing. |
This is a big PR. More work needed to check that it's not changing data as the methods are not the same as pandas.
Testing on our datasets has shown a ~ 10 X speedup of processing large delayed mode datasets, and a ~5 X speedup with nrt data. I'll write some example code comparing the two methods along with more tests. This is almost certainly a sub-optimal implementation of polars, as I'm directly aping the existing flow for pandas.
Using polars also decreases disk usage of intermediate products by using parquet rather than .nc. It also has substantially lower memory usage than pandas, so should decrease overheads when processing large datasets.
This is designed to resolve #36